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Abstract 

We present and discuss the characteristics and performances, both in term 
of computational speed and precision, of a numerical code which numerically 
integrates the equation of motions of 'particles' interacting via Newtonian 
gravitation and move in an external galactic smooth field. The force evalua- 
tion on every particle is done by mean of direct summation of the contribu- 
tion of all the other system's particle, avoiding truncation error. The time 
integration is done with second-order and sixth-order symplectic schemes. 
The code, NBSymple, has been parallelized twice, by mean of the Computer 
Unified Device Architecture to make the all-pair force evaluation as fast as 
possible on high-performance Graphic Processing Units NVIDIA TESLA C 
1060, while the O(A^) computations are distributed on various CPUs by mean 
of OpenMP Application Program. The code works both in single precision 
floating point arithmetics or in double precision. The use of single precision 
allows the use at best of the GPU performances but, of course, limits the pre- 
cision of simulation in some critical situations. We find a good compromise 
in using a software reconstruction of double precision for those variables that 
are most critical for the overall precision of the code. The code is available 
on the web site astrowww.phys.uniromal.it/dolcetta/nbsymple.html 
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1. Introduction 



The study of systems where each particle interacts with each other is the 
aim of a huge number of scientific fields (e.g. molecular dynamics, stellar 
dynamics, etc.). 

The so called classical A^-body problem, in particular, is of great relevance in 
astrophysics. Actually, the classical gravitational (or Coulombian) A^-body 
problem is not solvable (in its general form) analitically, so it is compulsory 
relying on numerical simulations. The direct all-pairs approach is formally 
simple but very expensive in terms of CPU time. The CPU time grows pro- 
portionally to the number of distinct pairs in the A^-body system, N{N—l)/2, 
i.e., for large it scales as A^^, i.e. the computational complexity is O^N"^). 
Due to this, to reduce computational load one usually resorts on far-field 
approximations or other approximation techniques for the large scale contri- 
bution to the force on a given element (that we call 'particle', also in the 
case of a star in a stellar system), limiting direct summation to a resticted 
number of neighbours. 

In this paper we approach the study of the evolution of star clusters in our 
Galaxy, as A^-body systems where an external galactic gravitational field, 
represented as an analytical expression, is summed to the interaction among 
all the stars of the cluster. We evaluate the all-pairs component by direct 
summation, avoiding spurious approximations; in order to speed up compu- 
tations, we exploited a double-parallelization on CPUs and on the hosted 
Graphic Processing Units (GPUs), which are modern pieces of hardware ca- 
pable of very high computing speed. 

The platform we used is made by a 2 Quad Core Intel Xeon 2.00Gifz work- 
station and a GPU NVIDIA TESLA C1060. This GPU of the last generation 
supports both single-precision (SP, 32-bit) and double-precision (DP, 64-bit) 
fioating point arithmetic. At present, each of the processing units in the 
TESLA CI 060 contains one DP processor alongside the 8 SP processors; the 
TESLA C1060 GPU has 240 threads, meaning that 30 threads are at work 
when fully exploiting DP calculations. 

Parallelization on CPUs has been done by mean of OpenMP Application Pro- 
gram, supporting shared-memory parallel programming, while parallelization 
on the GPUs makes use of the NVIDIA native programming language CUDA 
(Computer Unified Device Architecture). 
The OpenMP parallelization is mainly devoted to speed up 
the time-integration, performed by 2"*^ and 6*^^ order symplectic algorithms. 
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and the evaluation of the acceleration due to the external potential, which 
are both O(A^) tasks, while CUDA was involved in the pairwise forces eval- 
uation, whose computational load scales as A^^. 

In order to check reliability, precision and speed of our composite code, we 
realized various versions of the same basic code. A comparison of these dif- 
ferent implementations allows us to evaluate how convenient is the usage of 
graphic cards coupled to multiple CPUs for these types of simulations. 



2. The iV-body problem 

In a system of A^ gravitating point-masses, the force on the i — th body 
of mass rrii and position vector in a cartesian reference frame is the sum 
of the individual forces, Fjj, due to all the other j — 1,2, N, j ^ i bodies 
as given by 

F.- = G^^(r,-rO, (1) 
such that the total force acting on i, Fj, is 

N N 

F.= J2 F., = Gm, J2 tSi^'^' (2) 

where we set rij = — r^. The anti-symmetry condition Fij — —Fji holds, 
implying the linear and angular momentum conservation laws for the isolated 
case. The resulting system of equations of motion, subjected to given initial 
conditions, is 

=Fi + Fext{ri) 
r^{0) =rio (3) 

ri(0) = TiO: 

where F^xt accounts for an external force, expressed by Fextii^i) = ^Uexti'^'i) 
if conservative (V is the usual gradient operator acting on the external force 
potential, Uext)- 

It is well known that the existence of just 10 constants of motion, indepen- 
dently of A", does not allow a general analytical solution for A" > 3. This 
obliges to rely on numerical methods, whose application is made ackward 
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by the contemporary presence of problems on both the small length scales 
(close gravitational encounters) and the large ones (gravity has no screen, 
so the contribution of every component of the system, even if distant, must 
be included for a correct force evaluation). This 'double divergence' of 
the classic newtonian interaction potential has two different consequences: 
i) close encounters correspond to an unbound force between colliding bodies 
(|Fjj| — )■ oo when |rjj| — )> 0) yielding to an unbound error in the relative 
acceleration; ii) the need of the complete force summation over the whole set 
of distinct N[N — l)/2 pairs in the system implies an overwhelming CPU 
charge, practically unaffordable even with most modern, quick CPUs. 

Problem i) is often cured by mean of the introduction of a 'softening' pa- 
rameter, e, in the interaction potential, which assumes the smoothed form 

= ^ /I 2- (4) 

The corresponding total force acting on the i — th particle is 

N 

In the latter sum the condition j ^ i is no longer needed, because Fa — 
if e 7^ 0. Note that the introduction of the softening parameter corresponds 
to substitute point masses with Plummer's spheres (Plummer 1911), where 
the mass rrii is distributed around the centre according to the density law 

^'^""^ " 47re3 (e2 + ^2)5/2- (6) 

The reader interested to a deeper discussion of the pair interaction mollifi- 
cation may refer to Sect. 2 of the Capuzzo-Dolcetta et al. (2001) paper. 
It is relevant recalling that, in any case, the chance of close encounters is 
larger for small values of N, when the "granularity" of the system is more 
significant. 

Throughout this paper we will use the softened interaction given by Eq. 5 
with e = 0.005(i?/A^^/2), where R/N^^^ is a measure of the average distance 
of a particle of the A'"-body system to its closest neighbour. 

Problem ii) is faced in many different ways; we do not want to cite here 
them all in order not to be too long. We just say that the common line is 
introducing averaging (mean-field) methods, dividing the force acting on a 
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particle and due to the rest of the system into a "large" scale, slowly varying, 
coarse-grain, contribution (Fis) and into a "small" scale, rapidly varying, fine- 
grain, contribution, which is represented as a summation limited to a set of 
n < N neighbouring particles, to give 



Of course, setting n — N (which means that the neighbouring sphere 
contains all the system particles) makes the approximated expression above 
equal to the correct direct summation; in this case F^^ is contributed by an 
external force, only. 

3. The AT-body code 

The main scientific aim of this paper is producing a high precision, fast 
and reliable code to study the evolution of stellar systems of the size of 
open clusters and globular clusters. Our goal is the study of their dynamical 
evolution taking into account both the internal mutual force and the effect 
of the external galactic field. 

The code generates, first, the initial conditions for the particles of 
the system, whose individual masses are chosen by a given mass spectrum. 
For the scope of this paper, which aims, mainly, at investigating the code 
quality and performances in different hardware and software environments, 
we gave, for the sake of simplicity, all the particles the same mass rrii = m 
(?' = 1,2,..,A^). The total mass, M = Nm, is assumed as mass unit. For 
the same purpose of simplicity we give particles an initial spatially uniform 
distribution within a sphere of given (unitary) radius, R, with velocities, also, 
uniformly distributed in direction and absolute values and rescaled, in their 
magnitude, to reproduce a given value of the virial ratio (we remind that the 
virial ratio is defined as Q = 2K/\Q\, where K and Q are, respectively, the 
system kinetic and potential energies; for a stationary system, Q = 1). Note 
that the further assumption G = 1 in the equations of motion implies that 
the 'crossing' time T = (G'M)~^/^i?^/^ is the unit of time. 

The forces due to the mutual interaction among stars in the cluster (in- 
ternal forces) are given by Eq. 5 and summed to the effect of the external 
galactic field. The force due to the external field (Fext, second term in Eq. 3) 
is accounted by an analytical expression for the Galactic potential as given 



n 
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by Allen and Santillan (1991). In this paper the authors consider the Galac- 
tic potential as given by three components: a bulge, a disk, and a halo. The 
bulge and the halo have a spherical symmetry, while the disk is axisymmetric. 

3.1. Time integration 

It is well known that ordinary numerical methods for integrating New- 
tonian equations of motions become dissipative and exhibit incorrect long 
term behaviour. This is a serious problem when facing iV-body problems, 
particularly when studying them on a huge time baseline. One possibility is 
using symplectic integrators. 

Symplectic integrators are numerical integration schemes for Hamiltonian 

systems which conserve the symplectic two-form dp A dq exactly, so that 
(q(0)) p(0)) ~^ p(''")) is a canonical transformation. They are char- 

acterized by time reversibility. For both explicit and implicit integrators, it 
was shown (Yoshida 1991) that the discrete mapping obtained describes the 
exact time evolution of a slightly perturbed Hamiltonian system and thus 
possesses the perturbed Hamiltonian as a conserved quantity. This guaran- 
tees that there is no secular change in the error of the total energy (which 
should be conserved exactly in the original flow) caused by the local trunca- 
tion error. If the integrator is not symplectic, the error of the total energy 
grows secularly, in general. 

Our code allows the choice of two different symplectic methods. One is the 
simple, classic, "leapfrog" method, which is 2"^'^-order accurate; the other is 
a more accurate G'^'-order explicit scheme whose coefficients are taken from 
the first column of the Table 1 (SI6A) of Kinoshita,Yoshida, & Nakai (1991), 
which leads to a conservation of energy for a factor fifty better than that with 
the other two possible sets of coefficients. Of course, the G^'^-order symplectic 
integrator is much slower than the leapfrog, requiring 7 evaluations of force 
functions per time step, like, for instance, in a 6*^-order Runge Kutta method. 



4. Implementation of the code 

4.1. Hardware and software 

As we said before, the direct evaluation of the forces among particles in 
an A'"-body system is quite a heavy task for computers, mainly because of 

the computation of the elements of the |r.y | (euclidean distance) array, whose 
number of distinct elements isA^(A^ — 1)/2 (due to the symmetry |rjj| — \rji\) 
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Figure 1: Flowcha^ of our basic code 



thus scalings as N"^. As a matter of fact, every distance computation requires 
the evaluation of a square root, which is an expensive computational task. 
Actually, every pair force vector, Fjj, requires 20 -i- 30 floating point opera- 
tions. This means that time integrations of iV-body systems, when extended 
over a time interval large enough to have a scientific relevance, overwhelm 
the power of every single CPU platform for N above few thousands. If one 
wants to keep direct summation, without relying on approximations like, for 
instance, mean-field techniques, the only possibility is resorting on parallel 
computing and/or dedicated machines. 

It is out of the purposes of this paper the discussion of parallelization of 
A^-body codes on large main frames, which is not a trivial task, for the slow 
decay with distance of the gravitational interaction. Here we deal with the 
problem of implementing an efficient A^-body integrator on our computing 
CPU+GPU platform. 

The particular workstation we used has two Quad Core Intel Xeon processors, 
each running at 2.00 GHz, 4GB DDR2 RAM at 667 MHz and two NVIDIA 

TESLA C1060 CPUs, connected to the host via two slots PCI-E 16x. 
NVIDIA TESLA C1060 has 240 processors, each of them has a clock of 1.296 
GHz, for a nominal SP fioating point peak performance of 933 GFlops/s (78 
GFlops/s in DP). 

Some authors have already faced the problem of implementing gravitational 
A^-body integrations on CPUs with different approaches (Barsdell et al. 
2009). Here we explain how we get our own implementation, leading to 
a code, which we call NBSymple (ackronym for NBody Symplectic integra- 
tor). Various versions of the same basic code have been realized. 
The first is fully serial (NBSympleA), i.e. it runs on a single processor. This 
code constitutes a 'unit of measure' of the performances of the various par- 
allel versions. 

In a second version of the code (NBSympleB) we implemented paralleliza- 
tion, with Open Multi-Processing (OpenMP) directives, of both the 0{N'^) 
pairwise interactions and the 0{N) calculations (i.e. the time integration 
and evaluation of the Galactic component of the force on the system stars) 
over the double Quadcore host. OpenMP is an application programming in- 
terface (API) that supports multi-platform shared memory multiprocessing 
programming in C, C-|— |- and Fortran on many architectures, including Unix, 
Linux and Microsoft Windows platforms. It consists of a set of compiler di- 
rectives, library routines, and environment variables that influence run-time 
behavior. 
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In the third version (NBSympleC) the all-pairs interactions (0(A^^) load cal- 
culations) are demanded to the NVIDIA TESLA CI 060 GPU, using Compute 
Unified Device Architecture (CUDA, see 

://www.nvidia.com/ohject/cudaJiomelUmf^ while all the remaining tasks 
are done by a single (or multiple) CPU. CUDA is a parallel computing archi- 
tecture developed by NVIDIA that extends C by allowing the programmer 
to define C functions, called kernels, that, when called, are executed n times 
in parallel by n different CUDA threads. CUDA programming language is 
basically a C programming language extended with a number of keywords. 
CUDA threads may execute on a physically separate device that operates as 
a coprocessor to the host running the C program. In this case, the rest of 
the code runs on a single CPU. 

In another (fourth) version of the program (NBSympleD) we again use CUDA 
to evaluate the 0{N'^) portion of the code (like in NBSympleC), while the 
0{N) computations (time integration and evaluation of the acceleration due 
to the Galaxy) are parallelized sharing work between all the eight cores of 
the host, using OpenMP in the same way done in NBSympleB. 
The last (fifth) implementation (NBSympleE) uses CUDA on one or two 
CPUs to evaluate the total force over the system stars, i.e. both the all-pairs 
component and that due to the Galaxy. In NBSympleE, the 0(N) computa- 
tions were shared into an OMP part (time integration) and into a GPU part 
(smooth galactic force contribution) to maximize the efficiency. 
Table 1 summarizes the above (main) characteristics of the various versions 
of the code. 



Code Version 


IPE 


OMP (8PEs) 


GPU 


NBSympleA 


N + N'^ 






NBSympleB 




N + N^ 




NBSympleC 


N 






NBSympleD 




N 


Ar2 


NBSympleE 




N 


N + N'^ 



Table 1: Synoptic table summarizing the computational complexity of the tasks performed 
by the various processing units as exploited by the different versions of our NBSymple code 
(see Sect. 4.1). 
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4-2. The code structure 

The A^-body integration scheme consists of two main parts. In the first, 
given positions and velocities of all the bodies, the forces between stars 
and that due to the overall, smooth. Galaxy distribution are evaluated. Con- 
sequently, in the second part, the code predicts the velocities and positions 
of the particles by mean of the previously calculated accelerations (see flow- 
chart in Fig. 1). 

In all of the five implementations of the code the advancing in time of ve- 
locities and positions of stars of the system is performed by the CPUs. This 
choice is motivated by that time advancing is more sensitive to round-off 
errors and the time needed for its computation grows only linearly with N, 
making convenient using the double precision representation available on 
the CPUs. Actually, TESLA C1060 supports double-precision fioating point 
arithmetics, but in this case performances decay significantly with respect to 
using single precision. Another problem is due to the bandwidth available for 
the data exchange between CPUs and CPUs. Performing time integration 
on CPUs would require sending and receiving a huge amount of data from 
CPUs to CPUs, with a significant decay of performance. 
We now give a short description of how the code implementations that use 
CUDA handle communication of data between the host (CPU) and the de- 
vice (GPU). 

Before we invoke the kernel we copy the positions of the N particles on the 
CPU's global memory; after the calling we take the accelerations, calculated 
on the GPU device, and send them to the CPU memory. The kernel and de- 
vice functions used to evaluate the forces are very similar to those described 
in Nyland et al. (2007). In this latter work the authors introduced the notion 
of a 'computation tile', as a squared pxp sub- array of the |Fjj|, N x N force 
array. To calculate interactions we need the knowledge of the positions of 
2p bodies (for details see Nyland et al. 2007). These 'body descriptions' are 
stored in the shared memory, which has little reading latency (4 clock cycles) 
respect to global memory (400-600 clock cycles) . To achieve optimal reuse of 
data, the computation of a tile is arranged in a way that the interactions in 
each row are evaluated in sequential order (updating the acceleration vector), 
whilst the various rows are evaluated in parallel. 

The interaction between a pair of bodies is implemented as an entirely serial 
computation. A tile is evaluated by p threads performing the same sequence 
of operations on different data. Each thread updates the acceleration on one 
body as a result of its interaction with p other bodies. They load p body 
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descriptions from the GPU device memory into the shared memory provided 
to each thread block in the CUDA model. Each thread in the block evaluates 
p successive interactions. The results of the tile calculation are the p updated 
accelerations. 

A thread block is defined as a collection of p threads that execute some num- 
ber of tiles in sequence. In a thread block, there are N/p tiles, with p threads 
computing the forces on p bodies (one thread per body). Each thread com- 
putes all the interactions for one body. 

The kernel is invoked on a grid of thread blocks to compute the acceleration 
of all the N bodies. Because there are p threads per block and one thread 
per body, the number of thread blocks needed to complete all N bodies is 
N/p, so we have a ID grid of size N/p. The result is a total of A^ threads 
that perform A^ force calculations each, for a total of A^^ interactions. 
The number of calculations performed by the device is A^^ . This is redundant 
respect to the actual number of distinct pairs in the system, A^(A^ — l)/2; 
by the way, the limitation to A^(A^ — l)/2 of the number of force evalua- 
tions requires a heavier load of internal communication and synchronizations 
(Belleman et al. 2008), with a resulting net performance decrease. 
The dimension of the tile and the number of thread blocks depend on the 
number of particles involved in the simulation. There are 30 multiprocessors 
on the NVIDIA TESLA C1060, so the block dimension (p) must be such that 
N/p is 30 or larger to use all the multiprocessors available. 

The entire part of the program that performs the time integration of the 
system is enclosed in a parallel section, through OpenMP directives. We 
shared work between all the CPUs available declaring variables in an ap- 
propriate way, and using properly the OpenMP directives to parallelize the 
various 'for-cycles ' that perform the time integration and the computation 
of the galactic contribution to the accelerations (both scaling as A^). The 
copy of the data from and to the GPU and the kernel are invoked by the 
master thread. 

As we said above, our GPUs allow us the use of DP floating point rep- 
resentation. To reduce the bandwidth needed to transfer data between 
CPUs and GPUs we constructed C structures of four DP variables, emu- 
lating CUDA's float4. As we see in the next Section the performances of the 
program decrease a lot. On the other side, using the 6*'*order integrator we 
reach a precision 7 or 8 orders of magnitude better than that reached with 
single-precision arithmetics. 
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Finally, we realized another version of our program to work on n GPUs of 
the same type, besides sharing the job on the 8 CPUs of the workstation. 
Through OpenMP directives we created n threads. Each thread communi- 
cates with one of the device and copies the positions of all the particles on 
its global memory. Moreover each of the n threads, launches a kernel on one 
of the n available devices. So we have n kernels, each of which calculates the 
accelerations for N/n particles (in our case n = 2). After this calculation, 
the instanced thread copies the results on the host and the number of threads 
is reset to 8 to continue the time integration. 



5. Results 

5.1. Performances and accuracy of the codes 

We tested the performances in terms of both precision and speed of our 
code in its various versions. As outlined in Sect. 3, the initial conditions 
for the A'^-body system are picked from a uniform phase-space distribution 
of equal mass particles, with an initial virial ratio Q = 0.3. The particle- 
particle interaction potential is softened (see Eq. 4 and 5) and the system 
moves in the Galaxy, represented with the Allen and Santillan (1991) model, 
on a quasi circular, planar orbit at about the same galactocentric distance of 
the Sun. 

To evaluate the quality of integration we checked the time behaviour of 4 
quantities: the total energy {E) and the three components of the angular 
momentum (L) vector. The energy is a true constant of motion, while L 
should vary in time due to the external field torque, L = Me^,*, being a 
constant only in case of either isolated system or system embedded in an 
external spherically symmetric (respect to the system barycentre) potential. 
In any case, the external torque is expected to be little due to the small 
cluster size, as we checked by the computation of the total torque via direct 
summation, 

N 

Me^t = X Fea:t(ri) (8) 

i=l 

The energy and angular momentum conservation is evaluated through com- 
putation of 

^ E{t) - E{0) 

\m\ \m\ ' ^' 
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and 

AL L(t) - L(0) 

|L(0)| |L(0)| ■ ^ ' 

Of course the quality of both energy and angular momentum conserva- 
tion depends on the code version, the best conservation performances being 
achieved with the use of double-precision and 6*'* order symplectic integra- 
tion. Table 2 gives the run with time of the average errors per step in the 
case of the symplectic &^ order time integration with DP arithmetics. 

Fractional energy conservation per cycle is in the range [1.4 x 10^^"^, 8.8 x 
10~^^] , while the x and y components of the angular momentum are conserved 
in the range [1.4 x 10"^^ 6 x 10"^^]. 



t 


fl4fi) 








1(^1) 




|/|AL,|u 
l\ |L„| 1/ 




1.25 


8.83x10" 


-14 


3.20x10" 


■13 


1.88x10" 


■13 


3.54x10" 


■17 


5 


1.44x10- 


■14 


8.26x10" 


■12 


1.43x10" 


■12 


1.13x10" 


■17 


10 


1.37x10" 


-14 


1.39x10" 


■11 


3.12x10" 


■12 


4.53x10" 


■17 


15 


6.87x10" 


-14 


3.42x10" 


■12 


4.75x10" 


■11 


0.00x10 




20 


3.45x10" 


-14 


3.90x10" 


■12 


1.76x10" 


■10 


1.41x10" 


■17 


25 


8.42x10" 


■14 


9.32x10" 


■11 


2.42x10" 


■10 


3.68x10" 


■17 



Table 2: Percentage relative variations in energy and angular momentum per time step 
in the time interval [0,25], which corresponds to about 0.4 orbital revolution around the 
galactic center. The symplectic 6*'' order method is used in double-precision mode. The 
errors are obtained averaging the error values in 6 time intervals composed by 2500 time 
steps each . 



N 


480 


30720 


1536000 


NBSympleC 


5.97 


409 


423 


5) 


3.36 


21.9 


22.0 


NBSympleE 


11.29 


797 


846 




3.37 


43.8 


44.0 



Table 3: Performances in GFLOPs/s of two versions of the NBSymple code (C and E) 
in their SP (first and third row) and DP (second and fourth row) modes. Note that 
NBSympleE exploits the power of two TESLA C1060 CPUs. 

To have a reliable evaluation of the computing times, we ran the codes 
up to 0.57^ using a (constant) time step At = 5 x 10~^T, this means the code 
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Figure 2: The (averaged over 1000 cycles) solar time (in seconds) spent for one leap-frog 
integration step in single precision mode, as a function of N. Line with empty squares: 
NBSympleA code. Line with filled triangles: NBSympleB. Line with crosses: NBSympleC. 
Line with filled squares: NBSympleD. Line with stars: NBSympleE with a single GPU. 
Line with empty triangles: NBSympleE with two GPUs. 
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Figure 3: As in Fig. 2, but for the double-precision codes. The hne with filled squares is 
not reported because it is almost identical to that with crosses. 
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Figure 4: As in Fig. 2, but for the sixth-order, double-precision codes. The line with filled 
squares is not reported because it is almost identical to that with crosses. 



16 



I I I I \ \ I I I I \ ! I I I I I I I I I 

2.5 3 3.5 4 4.5 

Log N 



Figure 5: The (averaged over 1000 cycles) solar time (in seconds) spent for a single inte- 
gration step as a function of N by the NBSympleE codes in the leapfrog, hardware double 
precision mode (line with triangles) and in the leapfrog, software double precision mode 
(line with circles). 
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FiguK^ 6: Configiirations of the N = 15, 360 simulated cluster moving on the Milky Way 
simmetry plane, at various times, from t=0 to t=60. The cluster motion is counterclock- 
wise. 
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Figure 7: Snapshot of the cluster on quasi-circular orbit at the time t = 29.4 (i.e. at about 
half of its first revolution around the galactic centre). 
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Figure 8: Snapshot of the cluster on quasi-circular orbit at t = 84 (i.e. at about 1.4 
revolutions around the galactic centre). 
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Figure 9: Snapshots of the A'' = 15, 360 cluster plunging on quasi-radial orbit through the 
massive central object (M^/j/M = 10) at the times labeled in the various panels. 
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480 


30,720 


1,536,000 


task 


C 


D 


E 


C 


D 


E 


C 


D 


E 


1 


23.5 


36.4 


81.4 


58.4 


89.6 


94.2 


98.56 


99.74 


99.87 


2 


70.3 


36.0 




38.0 


9.0 




1.31 


0.18 




3 


4.5 


16.8 


13.8 


2.7 


0.9 


4.4 


0.10 


0.06 


0.10 


4 


1.7 


10.8 


4.8 


0.9 


0.5 


1.4 


0.03 


0.02 


0.03 



Table 4: Time profiling (in percent) of NBSymple in 3 of its versions (as labeled by the 
capital letter) for the 4 main tasks, labeled as in Fig. 1, for the three values of N given in 
the table heading. 



task 


ATi 


N2 


N3 


1 


94.0 


99.03 


99.9810 


2 


5.3 


0.90 


0.0165 


3 


0.5 


0.06 


0.0018 


4 


0.2 


0.02 


0.0007 



Table 5: Profiling for NBSympleC in its leapfrog, double precision, mode. The entries 
are the time fractions (in percentage) spent by the code in performing the 4 main tasks 
(numbered as in Fig. 1 and Tab. 4) for three values of the number of particles: A''i = 480, 
N2 = 30, 720, and N3 = 1, 536, 000. 



was run for 1000 cycles. This way, we have a robust average value of the 
time spent per cycle by the various code versions in each task by the simple 
ratio of the total time to the number (1000) of cycles. 

Fig. 2 shows the (average) time required by the various versions of 
NBSymple for a single time step integration with the leapfrog method. For 
a system of 480 < N < 1,536,000 particles (the largest value of used is 
quite representative of the number of stars in a real, popolous, globular clus- 
ter). In the case of the fully serial code on a single PE (NBSympleA) and of 
the OpenMP parallel code on the double quadcore host (NBSympleB), the 
test is limited to a maximum value of N=7680 because larger values require 
too long CPU times. Their computing times show clearly the expected 
behaviour. The ratio (version A to version B) of their computational times 
indicates a speed-up about ~ 3.6 (the best obtainable being 8, as the number 
of PEs). 

The most immediate sketch of the advantage in using the GPU is by compar- 
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task Ni N2 N3 



1 
2 
3 
4 



85.5 99.14 99.9819 

13.1 0.80 0.0163 

0.9 0.04 0.0012 

0.5 0.02 0.0006 



Table 6: As in Tab. 5, for NBSympleC in its sixth order, double precision, mode. 



ing the curve with empty squares to that with crosses, this latter referring 
to NBSympleC, that is the version of the code where the 0{N^) part of the 
code is performed by a single TESLA C1060 GPU. The speed-up given by 
the GPU ranges between ~ 15 (for N - 480) and ~ 330 (for = 7680). 
The NBSympleE code which exploits two TESLA GPUs becomes two times 
faster than when using a single GPU, showing an almost perfect efficiency 
when A^ is large enough to overcome the overhead, as evident in Fig. 2 for 
A^ ^ 10"^. The fastest version available (the one which uses two TESLA 
GPUs) requires one minute to accomplish a full integration time step for 
A" = 1, 536, 000; taking into account the need of at least 2 x 10^ cycles in a 
crossing time to have an acceptable precision, this means about 33 hours (1.4 
days) to simulate by direct summation the evolution over one internal cross- 
ing time of a globular cluster moving in the galactic potential, and about 211 
hours (8.8 days) to follow a complete revolution of the system around the 
galactic center, if the cluster is moving quasi-circularly at the Sun distance 
from the centre. 

Fig. 3 reports compared performances of the versions of NBSymple in double- 
precision mode. The basic, single PE, version decays in performances of a 
factor ~ 1.5 respect to the SP mode. The ratio between time spent by the 
single PE version and the OpenMP version remains similar to the SP case 
in Fig. 2. Of course, the best performances are achieved by the full GPU 
NBSympleE code, although the performance degradation respect to the SP 
version is of a factor ~ 18, as expected due to the characteristics of the 
TESLA C1060 architecture. 

The symplectic 6*'* order, double precision performances are shown in Fig. 

4. We decided to investigate the 6*^* order method performances in its DP 
mode, only, because the use of high order symplectic integration is done to 
keep high precision, which would be lost if using SP arithmetics. The gain 
in precision is, of course, paid by a significant loss in speed. At this regard. 
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a useful indication comes from comparing Fig. 3 and Fig. 4 with Fig. 1. 
Considering the, fastest, NBsympleE version (the one exploiting 2 CPUs) we 
see that the ratios between computing times per cycle, in the most significant 
range covered by all the figures (4 < LogA^ < 4.5) scales by 2 factors of 10 
for LogN — 4 (the time required for a time cycle by the 6*'* order DP code is 
about 10 times longer than by the 2"'' order DP code, which, in its turn, is 
about 10 times longer than the 2""^ order SP code) and by a factor of 20 and 
a factor of 12 for LogA^ = 4.5. The performance degrading when using the 
DP instead of SP with the 2"°' order leap-frog integrator is mainly explained 
by the specific architecture of the TESLA C1060 GPU (1 DP processor per 
8 SP processors), while the further degrading when using the 6*^ order in- 
tegrator is explained by both the 8 force evaluations per cycle needed and 
the repeated transfers from the host to the GPU device memory and back (a 
total of 16 per particle). 

It is relevant noting, however, that the need of a high order symplectic 
code is mostly for celestial mechanics apphcations (small values of N) where 
the CPU time is dedicated to follow a long integration time (thousands of 
orbital periods) rather than to the force computation which is (relatively) 
unexpensive because is small. 

5.2. Hardware and software double precision arithmetics 

The study of A"-body system evolution is, in many cases, characterized 
by the requirement of high precision in the computations. This for two main 
reasons, that are different for small and large values of A". 

For small values of A^ (celestial mechanics) the high precision in the ac- 
celeration calculation is required to avoid secular growth of the error over 
the huge time extension of the integration; for large values of A^, when time 
integration is not too extended, precision is mainly required to reduce the 
so called problem of cancellation of terms, i.e. the error due to difference 
between numbers that are very similar. This happens when evaluating the 
distances of bodies in the system in an inertial reference frame, usually cen- 
tered at the center of symmetry of the external potential where the system is 
orbiting, in the case when R << d, where R and d are the sizes of the system 
and its distance to the origin of the reference frame, respectively. The correct 
approach to the problem of error reduction is via coupling a symplectic, high 
order, integration scheme to double or quadruple (in the case of celestial me- 
chanics applications) precision arithmetics. At present, the NVIDIA TESLA 
C1060 GPU supports double-precision 64bit fioating point arithmetics in a 
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limited way, because each of the processing units contains one DP processor 
alongside the 8 SP processors. This means that there are only 30 DP units 
available if doing DP only calculations. The relevant performance degrada- 
tion convinced us to seek for a possible solution to join precision and speed. A 
straightforward way to emulate DP with single precision artihmetics is done 
transferring the DP representation of every particle space coordinate to the 
GPU memory where it is shared into two SP allocations, one where the most 
significant digits are stored and and another for the lesser significant digits. 
When forces have to be computed by the GPU, it joins the two SP memory 
allocations to give a good DP emulation (14 digits instead of 16) yielding to 
a satisfactory round-off error in the, delicate, evaluation of coordinate dif- 
ference. Such a representation of DP is known as 'double-single' precision 
(DSF0). 

The loss in performance has been checked to be acceptable, and the imple- 
mentation is not dependent on the particular hardware used. The practical 
implementation of DSP in our code was done by mean of the SAPPORO 
library (Gaburov et al. 2009). 

Fig. 5 shows a performance comparison between the DP and DSP codes us- 
ing the leapfrog integrator. It shows the (average) time spent for integrating 
the system for a single time step, as function of the number of bodies. It is 
evident that the DSP precision allows to integrate a system much faster than 
the DP precision. 

Finally, Tab. 7 shows the total energy conservation for the leapfrog and G*'^ 
order methods after 1000 time steps. 

5.3. Code time profiling 

In order to see whether and where it is possible to work to improve code 
performances, we checked the time spent by the versions of the codes that 
actually run on GPUs (versions C, D and E) in their different main tasks, 
which are essentially defined by the flow-chart in Fig. 1 and numbered from 
1 to 4. The profiling is resumed in Tabs. 4, 5 and 6. As expected, the 
pairwise force calculation (task 1) requires a longer time at increasing A^, 
rising to more than 99.8% for the largest A^. Only the version C of the 
code (the one which gives to a single CPU the O(A^) computational load) 



^http://crd. lbl.gov/ dhbailey/mpdist/ 
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Leapfrog 


Sixth Order 


N 


\AE/E,\ 


DP 


\/^E / Eq\dsp 


AE/Eq DP 


\^E / Eq\dsp 


480 


3.39x10 


-13 


6.68x10- 


-10 


3.09x10-15 


1.15x10- 


-10 


960 


1.57x10 


-10 


5.93x10- 


-11 





5.10x10" 


-11 


1920 


6.75x10 


-13 


1.90x10- 


-11 


3.04x10-15 


2.11x10" 


-11 


3840 


8.60x10 


-13 


1.18x10" 


-11 


8.10x10-1° 


1.19x10" 


-11 


7680 


1.12x10 


-12 


7.32x10- 


-12 


1.01x10^15 


4.64x10" 


-12 


15360 


1.18x10 


-12 


2.73x10" 


-13 


2.83x10-15 


5.12x10" 


-12 



Tabic 7: Relative errors in energy evaluated over 1000 time steps as AE/Eq = 
[£;(1000Ai) - E{0)]/E{0). 



spends a relatively significant fraction of time in other tasks than 1. Due to 
that version E performs both pairwise and external force calculations on the 
GPU, no distinction is possible between tasks 1 and 2, and this explains the 
absence of values in the task 2 row for this code version. 

6. Some simulation tests 

6.1. Quasi circular cluster orbits in the Milky Way 

In Figs. 6, 7 and 8 we display some snapshots of the projections onto 
the X — y coordinate plane of a cluster composed by iV = 15, 360 equal mass 
stars moving very close to the Milky Way plane of simmetry on a quasi 
circular orbit with radius ~ 8 kpc, simulated with NBSymplE using two 
TESLA C1060 GPUs. Fig. 6 shows a clear, quick development of a tidal 
tail in less than one orbital period (here, one orbital period is about 60T, 
where T is the internal crossing time, we adopted as time unit). Fig. 7 is a 
zoom of the cluster configuration at about half revolution around the galactic 
centre: note the evident barlike structure in the inner zone. After about 1.5 
revolutions around the galactic centre, the cluster shows two extended tails 
along its orbit, with two clumps, one in the leading and one in the trailing 
tail. These clumps have already been noted in previous high precision N- 
body simulations of globular clusters moving in an external field and their 
explanation is found in Capuzzo-Dolcetta et al. (2005) and Di Matteo et al. 
(2005) as due to a local deceleration of the stellar motion, causing an effect 
similar to a "motorway traffic jam" (essentially, if we compare the stellar 
motion to a fluid stream, for which the continuity equation holds, along the 
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tail an overdensity occurs whenever V • v < 0, where v is the colective stellar 
velocity). It is relevant noting that similar clumps have actually actually 
been observed in real globular clusters, the most know example being Pal 5 
(see Odenkirchen et al. 2003). 

6.2. Quasi radial cluster-massive black hole collisions 

For the sake of testing of the capabilities of our code we ran some "stiff' 
simulations, i.e. those of the face-on collision of a star cluster with a massive 
black hole sited at the centre of the galaxy, during its quasi radial motion 
in the central part of the Galaxy. These simulations, at varying the mass 
ratio between the black hole (represented as a further massive point) and the 
star cluster and at varying the number of stars in the cluster have a great 
astrophysical interest because they deal with the process of strong interac- 
tion with a massive central object that orbitally decayed globular clusters 
may have actually suffered in their motion in the parent galaxy. Apart from 
this scientific relevance, these simulations constitute a serious and difficult 
test for an N body code to face. Actually, the close interaction of a star 
system composed by point masses with a single body (the "black hole") 
much heavier than the individual cluster objects is hard to be followed by a 
numerical code because the very large acceleration induced by the massive 
black hole on passing-by cluster stars may cause an exceedingly large energy 
error. The energy error can, partly, be controlled by a proper reduction of 
the time step which cannot be, however, reduced below a threshold under 
which cumulation round-off error and CPU time get exceedingly large. The 
difficulty of such simulation grows at reducing the "coUision" impact param- 
eter b, i.e. the minimum distance between the cluster barycenter and the 
black hole during the cluster oscillatory motion around the massive object. 
To maximize the sharpness of collision we chose almost radial trajectories 
(6 ~ 0) for the cluster moving in the inner part of the Galaxy. A summary 
of results of our simulations at varying the black hole to cluster mass ratio 
is given in Table 8. Fig. 9 shows the motion of the cluster on the x, y plane 
at various times, showing the very fast development of the cluster "arc" -like 
shape around the apocenter. 

7. CODE AVAILABILITY 

We make available to the community the basic version of our code. The 
web site where the code, as well as instructions for running it plus some other 
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Mbh/M 


AE/Eo 


AL/Lo 


0.1 


3.06x10-^ 


5.90x10-4 


1 


7.75x10-^ 


1.06x10-2 


50 


9.23x10-5 


2.65 


105 


5.63x10-3 


3.91 



Table 8: Relative variations in energy and angular momentum over the time interval [0,4], 
which corresponds to about 6.4 complete radial oscillation through the galactic center of 
an = 15, 360 star cluster. The ratio between the black hole mass and the cluster mass is 
Mbh/M. The motion starts with cluster barycenter at {xo,yo,zo) = (10,0,0) with initial 
velocity (ioj 2/o, ^o) — (0, 0.OOlWc, 0), where Vc is the local circular velocity. The number of 
time steps is 4 x lO''; the total CPU time ~ 9.1 x 10^ sec. 



useful information, are found is: 
astrowww.phys.uniromal.it / dolcetta/nbsymple.html 
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